This page last changed on Sep 10, 2008 by stepheneb.

Storing Java jars and classes in git

I want to solve two problems.

  1. I'd like our Java code updates for people running SAIL/OTrunk application to be MUCH faster.
  2. I'd like to make it as easy as possible to replicate the content from an original repository to child repositories.

I've been using git a great deal recently and have been very impressed.

I think that we could get huge improvements by only using Java web start to install a minimal app that used the pure Java jgit library to get the rest of the resources that used to be delivered by web start with git instead.

I also think we could get the same benefit by using git to transfer content.

FYI: the very best intro to git for a programmer is Scott Chacon's Rails Conf talk. He's made it available here on his gitcasts web site as a screencast: http://www.gitcasts.com/posts/railsconf-git-talk

Table of Contents

Unknown macro: {section}
Unknown macro: {column}

Tests with the versioned OTrunk jar

I wanted to see what the possibilities were if we used git to both create a repository for our java code and used it's transport mechanism to move differences.

I grabbed copies of all 226 version of the otrunk jars (from: V0.1.0-20070419.221212-115 to: V0.1.0-20080717.124415-532) here: http://jnlp.concord.org/dev/org/concord/otrunk

Altogether these jars added up to 83MB.

I then ran two experiments with two variations.

  1. Create a git repo to hold each jar:
    The jars were copied in sequence and renamed to just 'otrunk.jar' in the process. I committed the single jar in the working dir to git and created a tag consisting of the version number.
  2. Create a git repository to hold the unzipped content from each jar:
    The jars were copied in sequence and renamed to just 'otrunk.jar' in the process. I ran unzip on 'otrunk.jar' and then deleted it. I then committed all the content in the working dir to git and created a tag consisting of the version number.

Initial conditions:

size of otrunk__V0.1.0-20080717.124415-532.jar: 440 K
size of otrunk__V0.1.0-20080717.124415-532.jar unzipped: 1.5 MB
otrunk-jars dir: 83 MB (all 226 jars)

Experiment 1, jars.

jars Size in MB notes
otrunk-jars-git: 79  
otrunk-jars-git: 12 after running git gc

experiment 2, expanded jars.

classs Size in MB notes
otrunk-classes-git: 28  
otrunk-classes-git: 2.9 after running git gc

The size of the raw differences are also tiny.

Here's the size in kB generated by git diff --raw comparing the HEAD of the master branch with both the oldest and most recent commits.

puts sprintf('%.1f', `git diff master~1 --raw | wc -c`[/\s*(\d+)/, 1].to_f / 1024) # => 0.4
puts sprintf('%.1f', `git diff master~225 --raw | wc -c`[/\s*(\d+)/, 1].to_f / 1024) # => 24.0

Using jardiff to generate a difference file for the most recent two revisions created a file that was about 24k.

That means using git produced a difference file that was about 50 times smaller than the equaivalent difference file generated by the Java web start servlet with jardiff.

Jardiff works by sending only the classes that have changed. In these tests git goes one very big step further by just sending the differences in the classes that have changed.

I don't have accurate numbers for the difference produced by jardiff for the oldest revision because it was larger than the 122k size of the pack.gz version of the jar file.

I also found out you can easily use git to generate a stream containing any revision of a file which can then be piped to a file or a socket.

Here's the ruby file I used to generate the tests:

file: otrunk_jars.rb

#!/usr/bin/env ruby

require 'rubygems'
require 'git'
require 'fileutils' 
include FileUtils::Verbose

def jar_name_and_version(path)
  path[/.*\/(.*)__(.*).jar/, 1]
  ["#{$1}.jar", $2]
end

rm_rf('otrunk-classes-git')
mkdir('otrunk-classes-git')
cd('otrunk-classes-git') do
  git = Git.init
  otrunk_jars = Dir.glob('../otrunk-jars/*')
  otrunk_jars.each do |jar|
    name, version = jar_name_and_version(jar)
    rm_rf(Dir.entries('.')-%w{. .. .git})
    cp(jar, name)
    `unzip #{name}`
    rm(name)
    git.add('*')
    git.commit_all("adding all the content from the unzipped: #{name}, version: #{version}")
    git.add_tag(version)
  end
  puts "uncompressed size of git dir: \n#{`du -chd0 .git`}"
  puts "running: 'git gc':\n#{`git gc`}"
  puts "compressed size of git dir: \n#{`du -chd0 .git`}"
  puts
end

rm_rf('otrunk-jars-git')
mkdir('otrunk-jars-git')
cd('otrunk-jars-git') do
  git = Git.init
  otrunk_jars = Dir.glob('../otrunk-jars/*')
  otrunk_jars.each do |jar|
    name, version = jar_name_and_version(jar)
    cp(jar, name)
    git.add(name)
    git.commit("adding #{name}, version: #{version}")
    git.add_tag(version)
  end
  puts "uncompressed size of git dir: \n#{`du -chd0 .git`}"
  puts "running: 'git gc':\n#{`git gc`}"
  puts "compressed size of git dir: \n#{`du -chd0 .git`}"
  puts
end

Tests with the resources used by the all-otrunk-snapshot jnlps

I decided to truly abuse git and created a git repository that contain all the resources referenced by ALL 634 of the versioned all-otrunk-snapshot.jnlps located here: *http://jnlp.concord.org/dev/org/concord/maven-jnlp/all-otrunk-snapshot/*

On the cc jnlp server in this dir: /home/sbannasch/src

I've got these two ruby scripts:

all-otrunk-jnlps.rb: creates the git repository all-otrunk-classes-git
measure_all_otrunk.rb: measures the performance of this repository

The all-otrunk-classes-git repository contains all the content in all the jars and native libraries referenced by all 634 revisions of all-otrunk-snapshot.jnlps for the last 15 months

  • from: 0.1.0-20070420.131610 April 20, 2007
  • to: 0.1.0-20080718.202956 July 18, 2008

The content of all the jars and native libraries were unzipped before committing to the git repository. Git efficiency at generating diffs and compressing content is much better when it is working with a large collection of smaller files than a smaller collection of jar files into which the original content has been packed. Each maven jnlp version of the resource set was committed, tagged. In addition a local branch was made referencing this commit.

The repository is 111 MB (size of the .git dir). The working directory is an additional 226 MB. The content in the working directory is created when a branch or tag is checked out of the repository (from content stored in the '.git' directory).

If you were to clone this git repository the data transferred would be about 111 MB.

When the all-otrunk jnlp is run from Java Web Start the jar and nativelib resources take just about 70MB in the web start cache

Here are some measurements of repository performance with respect to starting with tag: 0.1.0-20080718.202956 checked out in the working directory.

First I created local branches for all of the tags I'm testing, each local branch has a name with this form:

  local_<tag>
  

So the state of the commit object in the repository associated with tag 0.1.0-20080718.174145 is also a local branch named local_0.1.0-20080718.174145.

About the data collected:

The diff value is the size of what would be transferred over the network if you were updating from the older tag to the most recent tag.

The 'time to calculate diff' is based on generating the diff between the most recent tag: 0.1.0-20080718.202956 and the selected tag.

The checkout time is the time taken on troy to checkout that local branch for that tag starting at the initial condition of having the most local branch (master) for the most recent tag (0.1.0-20080718.202956) checked out.

Previous tags: 1, 2, 3, 4, 5

tag size of diff time to calculate diff time to checkout all 33000 files
tag: 0.1.0-20080718.174145 28k 0.240s 0.36s
tag: 0.1.0-20080718.142938 21k 0.240s 0.34s
tag: 0.1.0-20080717.194504 31k 0.240s 0.36s
tag: 0.1.0-20080717.164323 32k 0.260s 0.36s
tag: 0.1.0-20080717.160319 32k 0.250s 0.37s

Previous tags: 10, 20, 30, 40, 50

tag size of diff time to calculate diff time to checkout all 33000 files
tag: 0.1.0-20080716.203614 49k 0.250s 0.40s
tag: 0.1.0-20080713.114755 55k 0.260s 0.41s
tag: 0.1.0-20080706.085806 68k 0.260s 0.43s
tag: 0.1.0-20080701.210624 89k 0.270s 0.45s
tag: 0.1.0-20080627.160037 97k 0.270s 0.46s

Previous tags: 100, 200, 300, 400, 500

tag size of diff time to calculate diff time to checkout all 33000 files
tag: 0.1.0-20080530.221435 230k 0.290s 0.65s
tag: 0.1.0-20080306.193401 1683k 0.430s 1.40s
tag: 0.1.0-20080103.150020 2787k 0.620s 1.86s
tag: 0.1.0-20071005.143745 3027k 0.570s 2.33s
tag: 0.1.0-20070711.172752 3557k 0.740s 2.17s

Let the data in that table sink in a bit ...

Document generated by Confluence on Jan 27, 2014 16:56